Comparative Study of Word Alignment Heuristics and Phrase-Based SMT

نویسندگان

  • Hua Wu
  • Haifeng Wang
  • Dong Cheng
چکیده

This paper comparatively analyzes six different word alignment heuristics and their impacts on translation quality. We also propose a method to filter the noise in the phrase tables extracted by these heuristic methods and examine the effectiveness of combination of the methods. Experiments are performed on the Europarl corpus, where a multilingual in-domain training corpus, an in-domain test set, and an out-of-domain test set are available. Results indicate that (1) the heuristics show similar tendencies in the word alignment task on both test sets, but they perform differently in the translation task on the in-domain and out-of-domain test sets; (2) in general, the relationship between word alignment and machine translation performance is difficult to be predicted, depending on domains of the training and testing corpora besides other factors such as evaluation metrics and the characteristics of translation systems; (3) noise filtering and combination of these heuristic methods achieve larger improvement on the out-of-domain test set than on the in-domain test set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Length-Incremental Phrase Training for SMT

We present an iterative technique to generate phrase tables for SMT, which is based on force-aligning the training data with a modified translation decoder. Different from previous work, we completely avoid the use of a word alignment or phrase extraction heuristics, moving towards a more principled phrase generation and probability estimation. During training, we allow the decoder to generate ...

متن کامل

English-French Verb Phrase Alignment in Europarl for Tense Translation Modeling

This paper presents a method for verb phrase (VP) alignment in an English/French parallel corpus and its use for improving statistical machine translation (SMT) of verb tenses. The method starts from automatic word alignment performed with GIZA++, and relies on a POS tagger and a parser, in combination with several heuristics, in order to identify non-contiguous components of VPs, and to label ...

متن کامل

Improving Statistical Machine Translation with Monolingual Collocation

This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of...

متن کامل

Statistical Analysis of Alignment Characteristics for Phrase-based Machine Translation

In most statistical machine translation (SMT) systems, bilingual segments are extracted via word alignment. However, there lacks systematic study as to what alignment characteristics can benefit MT under specific experimental settings such as the language pair or the corpus size. In this paper we produce a set of alignments by directly tuning the alignment model according to alignment F-score a...

متن کامل

Phrase alignment confidence for statistical machine translation

The performance of phrase-based statistical machine translation (SMT) systems is crucially dependent on the quality of the extracted phrase pairs, which is in turn a function of word alignment quality. Data sparsity, an inherent problem in SMT even with large training corpora, often has an adverse impact on the reliability of the extracted phrase translation pairs. In this paper, we present a n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007